data standardization
Lost in the Pipeline: How Well Do Large Language Models Handle Data Preparation?
Spreafico, Matteo, Tassini, Ludovica, Sancricca, Camilla, Cappiello, Cinzia
Large language models have recently demonstrated their exceptional capabilities in supporting and automating various tasks. Among the tasks worth exploring for testing large language model capabilities, we considered data preparation, a critical yet often labor-intensive step in data-driven processes. This paper investigates whether large language models can effectively support users in selecting and automating data preparation tasks. To this aim, we considered both general-purpose and fine-tuned tabular large language models. We prompted these models with poor-quality datasets and measured their ability to perform tasks such as data profiling and cleaning. We also compare the support provided by large language models with that offered by traditional data preparation tools. To evaluate the capabilities of large language models, we developed a custom-designed quality model that has been validated through a user study to gain insights into practitioners' expectations.
Revisiting Differentiable Structure Learning: Inconsistency of $\ell_1$ Penalty and Beyond
Jin, Kaifeng, Ng, Ignavier, Zhang, Kun, Huang, Biwei
Recent advances in differentiable structure learning have framed the combinatorial problem of learning directed acyclic graphs as a continuous optimization problem. Various aspects, including data standardization, have been studied to identify factors that influence the empirical performance of these methods. In this work, we investigate critical limitations in differentiable structure learning methods, focusing on settings where the true structure can be identified up to Markov equivalence classes, particularly in the linear Gaussian case. While Ng et al. (2024) highlighted potential non-convexity issues in this setting, we demonstrate and explain why the use of $\ell_1$-penalized likelihood in such cases is fundamentally inconsistent, even if the global optimum of the optimization problem can be found. To resolve this limitation, we develop a hybrid differentiable structure learning method based on $\ell_0$-penalized likelihood with hard acyclicity constraint, where the $\ell_0$ penalty can be approximated by different techniques including Gumbel-Softmax. Specifically, we first estimate the underlying moral graph, and use it to restrict the search space of the optimization problem, which helps alleviate the non-convexity issue. Experimental results show that the proposed method enhances empirical performance both before and after data standardization, providing a more reliable path for future advancements in differentiable structure learning, especially for learning Markov equivalence classes.
Exploring the Feasibility of Automated Data Standardization using Large Language Models for Seamless Positioning
Lee, Max J. L., Lin, Ju, Hsu, Li-Ta
We propose a feasibility study for real-time automated data standardization leveraging Large Language Models (LLMs) to enhance seamless positioning systems in IoT environments. By integrating and standardizing heterogeneous sensor data from smartphones, IoT devices, and dedicated systems such as Ultra-Wideband (UWB), our study ensures data compatibility and improves positioning accuracy using the Extended Kalman Filter (EKF). The core components include the Intelligent Data Standardization Module (IDSM), which employs a fine-tuned LLM to convert varied sensor data into a standardized format, and the Transformation Rule Generation Module (TRGM), which automates the creation of transformation rules and scripts for ongoing data standardization. Evaluated in real-time environments, our study demonstrates adaptability and scalability, enhancing operational efficiency and accuracy in seamless navigation. This study underscores the potential of advanced LLMs in overcoming sensor data integration complexities, paving the way for more scalable and precise IoT navigation solutions.
CleanAgent: Automating Data Standardization with LLM-based Agents
Data standardization is a crucial part in data science life cycle. While tools like Pandas offer robust functionalities, their complexity and the manual effort required for customizing code to diverse column types pose significant challenges. Although large language models (LLMs) like ChatGPT have shown promise in automating this process through natural language understanding and code generation, it still demands expert-level programming knowledge and continuous interaction for prompt refinement. To solve these challenges, our key idea is to propose a Python library with declarative, unified APIs for standardizing column types, simplifying the code generation of LLM with concise API calls. We first propose Dataprep.Clean which is written as a component of the Dataprep Library, offers a significant reduction in complexity by enabling the standardization of specific column types with a single line of code. Then we introduce the CleanAgent framework integrating Dataprep.Clean and LLM-based agents to automate the data standardization process. With CleanAgent, data scientists need only provide their requirements once, allowing for a hands-free, automatic standardization process.
Structure Learning with Continuous Optimization: A Sober Look and Beyond
Ng, Ignavier, Huang, Biwei, Zhang, Kun
Bayesian networks are a class of probabilistic graphical models that encode probabilistic distributions in a compact way (Pearl, 1988; Koller and Friedman, 2009). Recovery of their graphical structures from data, represented by directed acyclic graphs (DAGs), has found applications in several fields such as genetics (Peters et al., 2017) and education (Gong et al., 2022). This problem is NP-hard in general (Chickering, 1996; Chickering et al., 2004) owing to the combinatorial space of DAGs. Classical structure learning approaches fall into two broad categories, i.e., constraint-based methods and score-based methods. Constraint-based methods, such as PC (Spirtes and Glymour, 1991), employ conditional independence tests to estimate the skeleton and further perform edge orientation up to the Markov equivalence class (MEC) (Spirtes et al., 2001). Score-based methods typically assign a score to each structure and search for a high-scoring structure in the space of DAGs or equivalence classes (Koivisto and Sood, 2004; Singh and Moore, 2005; Cussens, 2011; Yuan and Malone, 2013). These methods often adopt greedy search because of the large space of possible structures (Chickering, 1996), such as GES (Chickering, 2002) and GDS (Peters and Bühlmann, 2013). Recently, Zheng et al. (2018) proposed a smooth characterization of acyclicity and transformed the structure learning problem of discrete nature into a continuous, nonconvex optimization problem, thus enabling the application of gradient-based methods.
Marketing Analytics Insights Using Machine Learning
Many industry-leading companies are already using data science to address better decision-making and to improve their marketing analytics. With the expanded industry data, greater availability of resources, lower storage, and processing costs, an organization can now process large volumes of frequent, and granular data with the help of several data science techniques and obtain the leverage needed to create composite models, deliver crucial decision-making, and obtain essential consumer acumen with higher accuracy than ever before. Using data science principles in marketing analytics is a determined, cost-effective, practical way for many companies to observe a customer's behavior, journey and contribute toward a more customized experience in their decision-making processes. In this article, we will be using machine learning to segment customers' data, specifically data clustering, PCA, and data standardization for large-scale analytics to dive into specific marketing insights with real-life data. The segmentation of customer data is the process of ordering (segmenting) target customers into different groups based on demographic or behavioral data so that marketing plans can be tailored more precisely to each group.
Data preprocessing techniques with scikit-learn
The scikit-learn library includes tools for data preprocessing and data mining. It is imported in Python via the statement import sklearn. Data can contain all sorts of different values. It is hard to interpret when data take on any range of values. Therefore, we should convert the data into a standard format to make it easier to understand.
NIMML Delineates the Path for Personalized Nutrition: Challenges and Solutions
The Nutritional Immunology and Molecular Medicine Laboratory (NIMML), a leading lab at the Biocomplexity Institute of Virginia Tech is applying artificial intelligence (AI) methods to personalized nutrition and health. These efforts are aligned with the Precision Medicine Initiative (PMI) which not only aids the researchers and physicians cure people, but also empowers individuals to monitor and take a more active role in their own health. As opposed to the PMI, personalized nutrition refers to tailored nutritional recommendations aimed at the promotion, maintenance of health and prevention against diseases. However, there are numerous challenges in the path of making personalized nutritional recommendations for the health well-being and disease prevention. The "one-size-fits-all" template is based on generic suggestions regarding nutritional recommendations for improving an individual's health are not helpful.
Machine learning collaborations accelerate materials discovery – Physics World
In 1863 five members of the Chōshū han in Japan made a secret journey to University College London in the UK to study. At the time of their departure, travel overseas was illegal in Japan, nonetheless all five students made an impact on the University that is commemorated to this day, and returned to establish institutions that augured a new era in their homeland, including the National Mint, the Japanese railways and the first Prime Minister. In the same spirit of international collaborations fostering pioneering innovations, materials and data scientists met at the Japanese Embassy in London on Friday 21st June during the "Season of Culture" to discuss "Global Trends in Research on Data-driven Discovery in Materials Science". The event was the 10th scholarly colloquium organized by the journal Science and Technology of Advanced Materials (STAM). Developments in data present an interesting example in science diplomacy where science and technology may facilitate a diplomatic agenda that in turn serves the interests of science.
Artificial Intelligence Faces Age-Old IT Challenge: Data Standardization
"AI starts with data, and if the data is lousy, you're not going to make any great AI," said Bob Friday, co-founder and chief technology officer at AI-driven wireless network company Mist Systems, speaking on a panel about AI in IT. The data coming out of various firewalls, routers, load balancers and other devices that applications depend on varies based on vendors providing the equipment. But enterprises can make the most use of AI if the volumes of data extracted from the various elements within a network are formatted the same way regardless of origin, IT leaders said. Data standardization could be a precursor to large AI-based projects within IT infrastructure, and that conversation has already been going on within the IT community for some time. "Two years ago we were asking ourselves will we ever get to a standardized place … it sounds like that's still not settled," said Neal Secher, senior vice president and head of networks and data center modernization at State Street Corp., at the panel event.